Loop vectorization using SIMD instructions
goal of this excercise is to speed-up loop execution streaming fp data
in parallel (four by four) through SIMD instructions
- Vectorize the generation of normally (gaussian) distributed
random numbers
- Vectorize the chi2 calculation (including kahan summation)
- Vectorize "vl_fast_atan2_f"
- try to "convince" the compiler to vectorize loops using
ftree-vectorize
- Measure speed-up
- Apply all this to the minimization example
- A long term project: vectorize the Wallace's normal random number
generator in C++
Code
in include:
SSEArray.h
gaussian_ziggurat.h
in examples:
SSEMathFun_t.cpp ("the
vectorized loop")
approxPhi.cpp (approximated
atan2 to vectorize)
exRandom.cpp (example of random
generators' usage and blending of a 16-wide vector)
Hints
pfmon --long-smpl-period=5000 --resolve-addresses --smpl-per-function --smpl-show-top=20 ./a.out
pfmon -e UNHALTED_CORE_CYCLES,ARITH:CYCLES_DIV_BUSY,SSEX_UOPS_RETIRED:SCALAR_SINGLE,SSEX_UOPS_RETIRED:PACKED_SINGLE ./a.out k
References
a partial SSE port of
the cephes floating point algorithms
Mersenne
Twister Home Page
Normal
Distribution on wikipedia
David
B. Thomas; Philip G.W. Leong; Wayne Luk; John D. Villasenor (October
2007). "Gaussian Random Number Generators" (pdf)
A fast vectorised
implementation of Wallace's normal random number generator
Random
Number Generation on GPUs